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Introduction - Traditionally, chemical information systems have relied upon two-dimensional (2D) 
connectivity tables as a means to represent chemical structures. Graph theory (/'.©., subgraph 
isomorphism algorithms, clique detection methods), sometimes coupled with distance geometry 
approaches (e.g., difference distance matrix), have and continue to be used for structure, 
substructure and similarity searching driven by geometric or property-based queries (1-5). 

Recent advances in computer technologies and dramatic cost reductions for physical 
memory and mass storage devices now allow for extensive data storage. Simultaneously, modem 
workstations have undergone a major expansion in computational power, providing the capability to 
access the hundreds of thousands of data records needed to search and analyze structural as well 
as conformational information. As a result, impressive progress in the development of automated 
methods for generating three-dimensional (3D) structures from traditional 2D chemical diagrams 
has occurred. 

The advent of straightforward 2D to 3D structural database conversions has prompted the 
simultaneous development of methods to retrieve, classify (or cluster), and analyze structural 
information, especially as they relate to the interrelationships of structures to biological activities. 
Recent biological technologies have further emphasized the fundamental value of molecular 
structures, constitutive building blocks that can clarify drug-receptor associations and help 
understand biological functions of receptor proteins. As reviewed in this chapter, structural 
databases can provide invaluable resources to medicinal chemists, offering an expanding range of 
tools to enhance the drug discovery process, including methodologies in information retrieval, 
knowledge-based strategies, multiple docking procedures, and de novo design approaches. 

STRUCTURAL DATABASES AND SIMILARITY SEARCHES 

Similarity searches initially involved geometric matching of structural groups, a simple 
strategy using query patterns based on the atom connectivity of the 2D chemical structures stored 
in databases. For example, searches are commonly used to verify that a particular chemical 
structure is indeed novel (e.g., patent or spectral search)(6). Common database searching 
strategies have been reviewed in the recent literature (1,4,7). 

2D Database Searches - Such database searches are global in nature, since the query is meant to 
identify structures that present total or partial similarity. When dealing with structures that relate to 
biological activities, however, the database queries need to incorporate geometric features to reflect 
on the presence of particular structural elements that appear to impact on the biological activity, i.e., 
the pharmacophore or pharmacophore moieties. A pharmacophore can thus consist of non- 
connected atoms spread throughout an active structure. It can also be an ensemble of atom types 
and their distance relationships, pharmacophore moieties (e.g., presence of a particular heterocyclic 
moiety, H-bond acceptor or donor, positive or negative charges), or receptor site points. Database 
searching must therefore provide the flexibility to identify crucial substructural or geometric 
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elements. The methods of subgraph isomorphism and distance geometry have attempted to 
address these needs as amply reviewed in the recent literature (4,5,8). 

The majority of the 2D searching methods retrieve only exact query matches, overlooking 
potential hits (9). Newer methods are currently being developed to overcome this limitation. For 
example, a two-step approach involving unique subsimilarity screening followed by an approximate 
maximal common subgraph matching step was shown to find substructures that nearly match (i.e., 
•fuzzy match") the initial query (9). 

2D to 3D Database Conversion - The program MOLPAT was among the first to introduce 3D 
concepts in database searching (10). Evolving from 2D substructure search systems, the first 
generation of 3D similarity searching systems were aimed at a structural geometry similarity search 
based on an input target structure. The database search procedure involved a two-level retrieval 
algorithm in which the time-consuming geometric search was preceeded by a rapid screening 
search based on interatomic distance ranges (4,5,8). The atom mapping technique is an example 
where the degree of resemblance between pairs of 3D structures is calculated from their interatomic 
distance matrices (8,1 1 ,12). The computational efficiency of the method was recently assessed and 
compared to others (8,1 1 ). 

The emergence of powerful molecular graphic systems has intensified the desire to expand 
database retrieval mechanisms beyond 2D molecular connectivities to incorporate 3D properties in 
the queries. Today, most database retrieval systems have been upgraded to 3D structures as 
examplified in the recent availability of software tools from various commercial sources such as 
Cambridge Structural Database (CSD)(13) with the QUEST3D and VISTA programs (14), Chemical 
Design Ltd with MACCS3D (6), Molecular Design Limited with CHEMDBS3D (15), Tripos 
Associates Ltd. with UNITY (16), Chemical Abstract System (17), and Fine Chemical Directory 
(FCD)(6). 

The rule-based program, CONCORD (18) is the most popular procedure to convert existing 
2D databases (19). Using artificial intelligence methods, an alternative approach has been 
developed in COBRA (20), a successor to WIZARD. This program decomposes structures into 
'simple conformational units" and progressively assembles reasonable 'subconformations" for each 
unit. The conformational space of a molecule is then represented as a tree-like structural assembly 
which lends itself to graph-searching techniques. Like all other knowledge-based systems, this 
method is limited by the initial dictionary of conformational units considered. Efforts to incorporate 
distance geometry template generation to address this limitation are anticipated. 

3D Conformational Flexibility - An important issue in 3D database system development rests with 
the ability to rapidly and efficiently address conformational flexibility for 3D structures converted 
from a 2D connectivity table. At this point, there is no universally accepted solution to this problem. 
For example, conformational information may be stored explicitly in the database in the form of 
multiple conformations or implicitly in the form of interatomic distance ranges. In the explicit 
approach, it becomes necessary to preload multiple conformations into a database. Major 
difficulties lie in the selection and number of conformations to retain for each structure. Not only 
does this impact the size of the database and the speed of the search, but also the successful 
identification of relevant conformations (7). In the implicit approach, the conformational flexibility is 
inferred by multiple interatomic distance ranges, involving distance geometry algorithms, and is 
widely illustrated in programs such as ALADDIN (21) or EMBED (22). The use of bounded distance 
matrices, a combination of distance geometry and graph theory methods, was shown to be an 
efficient means to address the conformational flexibility while matching pharmacophoric patterns in 
a 3D database (23). Conformational flexibility may also be taken into account while the database 
search is being performed. Flexible search queries or "on-the-fly" conformational analyses have 
been reported (24). The Sybyl UNITY database system illustrates another recent initiative to 
incorporate newer methods such as a tweak algorithm that allow for conformational expansion (16). 

STRATEGIES IN 3D SIMILARITY SEARCHING 

!n the absence cf a receptor structure, 3D databases can be searched indirectly, matching 
the known 3D structures of various ligands on the basis of 3D similarity. This computerized 
screening is used to identify structural leads (25). 

Grid-based Similarity - As originally introduced by Carbo, molecular similarity between two 
molecules can be determined from the electron density derived from quantum mechanical 
calculations (26, 27). The technique has since been extended to use other structural properties 
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such as electrostatic potentials, electric fields, and shape. In the program ASP. for example, 
molecules are surrounded by a rectilinear grid; the structural property is evaluated at each gnd 
intersection and Integrals are evaluated numerically or by utilizing Gaussian functions rather than 
grids (28-32). The similarity index resulting from these calculations provides a means to 
quantitatively relate the molecular similarity to the observed biological activities. In the program 
MEPCONF, molecular similarity is derived from the molecular electrostatic potential (MEP) mapped 
on a grid surface that is proportional to the van der Waals surfaces of two molecules (33). Both 
rigid rotations and translations of one system relative to another and internal conformational 
expansion of the relative molecules are carried out in an iterative procedure to maximize the 
similarity. The similarity is measured using the Spearman rank correlation coefficient on pairs of 
points on the common, coincident, grid computed at each iteration. 

Surface and shape similarity - The 3D similarity can also be derived from molecular surfaces or 
shape analyses. For example, a 3D shape definition has been described by considering points on 
spheres that surround the molecules (3,36). In this gnomonic projection, 3D molecular properties 
such as size and electrostatic potentials can be assigned as scalar or vector functions. 

In a distinct approach, SPERM, an icosahedral mapping and matching procedure (37) has 
been used to specify molecular properties surrounding molecules (38, 39). 3D molecular shape is 
the primary matching parameter and requires an initial superposition of centroids or center of mass 
for the molecules compared. A 3D space scanning is applied to the molecule being compared to a 
reference template, leading to a shape similarity score evaluated from the sum of squares of the 
differences (SSD). Any chosen molecular property such as molecular electrostatic potentials or 
electron density can then be further quantified using either Carbo (26, 27) or Hodgkin (29) similanty 
indices. 

Another method has been derived from the use of approximate wavefunctions (40). The 
shape of a molecule is defined by exploring the molecular surface with a helium atom probe. The 
calculation involves the evaluation of a simple expression for the repulsive interaction between 
molecule and probe. The similarity index is calculated from the overlap integrals between the 
orbitais of one molecule and those of the other. 

Following an initial atom-matching alignment, the molecular shapes have been used to 
optimize the alignment (41). A surface comparison is evaluated from all surface dots within a 
certain distance of the corresponding dots of the mean surface. It has been used in an effort to 
compare biologically active or inactive compounds belonging to a similar structural type. 

Two hashing schemes, one based on connectivity and the other based on internal 
distances, have been applied to the problem of similarity and clustering of shape data (42). 

Shape similarities have traditionally used geometrical representations, but topological 
representations have also been explored (43). The shape group method has been extended to 
utilize density domains of molecules analyzed at different electron density values (44). The electron 
density is calculated by an ab initio method. The shape codes generated in the analysis can then 
be used to provide a similarity measure for two molecules in a topological rather than a geometrical 
description. 

Pharmacophore-based Similarity - In the majority of the 3D similarity studies, a preliminary 
molecular alignment is necessary. Usually, geometric features derived from pharmacophore 
hypotheses have been used to guide manual superimpositions of corresponding molecules, 
sometimes handled through molecular graphics manipulations. Recent studies illustrate the trend to 
automate molecular superpositions, building upon the 3D database retrieval techniques to limit 
biases often introduced by individual users. 

In the program SUPER, the correspondence of molecules is achieved through optimum 
superposition of two molecules using their overlap integrals (45). SUPER is designed to identify the 
best twenty superpositions by generating a network of grid points at the van der Waals surfaces of 
the two molecules and assigning united atom potentials to these points. The search for the 
superposition is carried out by rigid body translations and rotations such that each atom of molecule 
1 is eventually paired with each atom in molecule 2. SUPER discards matches of points if the 
difference between two potentials is above a certain set value. This is also used as a measure of 
the goodness of fit. 
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were found to inhibit chymotrypsin. In a more recent version of the program, DOCK2, better 
sampling and a more systematic searching of the orientation space have been reported (60). 
DOCK2 also includes a lattice-based method for evaluating the goodness of fit. These new features 
have been assessed in a variety of applications. DOCK was recently applied to the crystal structure 
of L casei thymidylate synthase (TS) using the FCD (61). The TS substrate deoxyuridine 
monophosphate (dUMP), and several known nucleotide and non-nucleotide inhibitors of the 
enzyme, were identified and adequately scored by the computational screen. Additional molecules 
unknown to bind TS (e.g., sulisobenzone), were also retrieved from the database and shown to 
inhibit the TS enzyme. 

An approach using bidimensional surface profile to find molecular shape complementarity 
reduces the complex 3D surface to a 2D representation called 'angular profiles". The observed 
crystal-bound conformations of kallikreine and bovine pancreatic trypsin inhibitors were reproduced, 
but it was not the best matching solution, suggesting that this method could be used as an initial 
step toward a 3D study of the shape complementarity between two molecular surfaces (62). 

An alternative approach, patterns of points or webs are generated to represent the two 
molecular surfaces (63). McLachlan's least squares fitting method (64) is then used to match 
corresponding webs in the two surfaces. Local complementarity and van der Waals interaction 
energy are used as filtering criteria to select multiple docking orientations. These are then scored 
based on electrostatic interaction energy. The authors suggest that this method is amenable to 
coarse dihedral sampling to address conformational flexibility of the ligand. 

Another method that uses similarity in the design of compounds involves determining the 
electrostatic properties of the ligand in the field of the receptor. If electrostatic characteristics are 
important, then their determination will allow searches for molecules that have similar charge 
characteristics. A program, YING, determines the point charges of a ligand bound to a receptor 
based on maximizing the complementarity of the charges of the ligand and receptor (65). The 
method uses the van der Waals surface of the ligand, the electrostatic potential of the receptor at 
that surface, and the partial charges of the ligand atoms. The program then maximizes the charge 
complementarity while still maintaining any formal charge associated with the ligand. Ligand and 
receptor atoms are kept fixed during this process. 

Metropolis Monte Carlo Method - In the program DISDOCK, a simulated annealing procedure is 
applied to position and orient the interacting molecules, which are both treated as rigid bodies (66). 
The distance constraints that are necessary to guide the relative intermolecular orientation are 
derived from a selection of atom pairs such as a hydrogen donor and its complementary acceptor 
atom. When applied to a set of serine proteinase complexes, the best results are obtained when 
known distance constraints are available. The results also point to a dependence on the starting 
ligand conformation. 

In AUTODOCK, a Monte Carlo conformational analysis is combined with rapid energy 
evaluation using molecular affinity potentials (67). A limited number of rotatable bonds can be 
randomly varied in the ligand structure. Although this approach does not require preliminary 
knowledge of the binding site, it appears that the substrate binding modes can be affected by the 
starting orientation. In an alternative strategy, rigid molecular fragments are docked into the binding 
site (68). The unique feature of this method is the ability of buried fragments to float to the binding 
pockets from inside a grid representation of the receptor as opposed to placing the ligand at some 
arbitrary position well outside of the receptor grid. This is accomplished by a scoring function that 
measures the average distance to the surface points on the grid. Simulated annealing is then used 
to complete the simulation. 

Another method utilizes a two step procedure. A predocked ligand is submitted to an initial 
coarse sampling (i.e., rigid body translation and rotation) in positional space with limited 
conformational exploration. A Metropolis Monte Carlo (69) minimization is then applied with 
complete freedom of movement of the ligand to optimize the structures obtained from the first step. 
The receptor is rigid throughout the simulation and the method does not use a grid/surface based 
representation (70). 

Protein-Protein Docking - Although the recent literature emphasizes docking procedures that involve 
a small molecule ligand binding to a macromolecule, molecular complementarity also applies to 
protein-protein recognition phenomena. Early attempts in protein-protein docking relied either on 
electrostatics or geometric description of the molecular surfaces (71). 
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The DOCK program has been extended to the area of rigid body protein-protein docking 
(72). In a similar approach, simplified protein models with one sphere per residue are subjected to 
simulated annealing using a crude energy function where the attractive component is proportional to 
the interface area (73). The procedure finds clusters of orientations in which a steric fit between the 
two protein components is achieved over a large contact surface. 

In another approach, each protein is reduced to sets of surface and internal volume grid 
points (74). Molecules are then docked by matching surface grid points. Optimal orientations are 
those with maximum matching surface points and minimum overlapping volume cubes. 

PE N OVO DESI GN 

Ligand Structures - The program CAVEAT was among the first to utilize 3D database geometry 
features to retrieve novel structural templates (75). In a recent application, the CSD was searched 
to extract structures in which specified bond directionalities were maintained at the correct distances 
and angles to identify structural peptidomimetic templates. In a related approach, the 
FOUNDATION database system retrieves molecules based upon matching some minimum subset 
of elements of the query (76). The query elements include atom position and type, bonded and 
nonbonded atoms, RMS fit, subsets with an occupancy range, and volume restrictions. By using 
appropriate atom type definitions, general characteristics such as hydrogen bond donor/acceptor 
and hydrophobicity can be included in the query. The unique feature in FOUNDATION is the ability 
to search for any combination of some specified lower limit of the elements that comprise the query. 
This "fuzzy match" allows the retrieval of molecules that do not fit the entire query. This can be 
used as starting points in the design of novel ligands, or in the selection of compounds to be tested 
in an assay. Since molecular fragments can be in the input, searches for bridging elements to link 
the important elements of a pharmacophore hypothesis can be run. The resulting hit list can be 
ranked by number of matching atoms, RMS-fit, or steric complementarity if a volume constraint is 
used. 

In the program GROW, peptidic fragments are placed and connected within the confines of 
the receptor active site (77). The template library contains amino acid fragment units in their 
multiple low energy conformations. From an initial seed point, all template fragments are evaluated 
for binding site fit. Scoring is based on a molecular mechanics potential, including a solvation term. 
The growth can take place without any restrictions or can be directed by the use of several control 
parameters. 

LUDI is a rule-based system that makes use of contact patterns found in the CSD to 
perform automatic design of novel compounds (78). LUDI looks for hydrogen-bonding interactions 
with appropriate distance and angle characteristics, lipophilic-aliphatic and lipophilic-aromatic 
interactions. The fragments employed are small molecules progressively connected to fit a binding 
site. LUDI is now capable of starting from an existing fragment, allowing incorporation of features 
extracted from a known ligand. The scoring function takes into account the number and quality of 
hydrogen bonds and also the contact surface area. 

Another method, DELEGATE (within the software package BUILDER), generates an 
irregular lattice of points within an active site (79). First a DOCK run is performed to place a number 
of appropriate molecules within the active site. This collection of molecules is used to construct the 
irregular lattice, fusing the atoms of the DOCK structures with correct distance and angles into a 
composite structure that fits the active site. Additional minimization is required to regularize the 
structure. Electrostatic, hydrogen bonding, and shape complementarity characteristics can be 
accomodated. This lattice can also be used manually to view the disconnected structures and 
prune/connect atoms to form a composite structure. 

The database system CLIX (80) utilizes a series of GRID (81) runs to determine where 
various probes would have favorable interactions with a target receptor. This produces an 
ensemble of possible sites of interaction. The database is comprised of structures from the CSD. 
CLIX searches for at least three coincident favorable points of interaction between the grid and the 
ligand with appropriate steric fit. The fits are scored based on the sum of the interaction energies. 

GenStar is designed to grow tetrahedral atoms into an active site (82). The input controls 
include the heavy atoms of interest at the active site, a seed point (either from the enzyme or from a 
known inhibitor), and maximum number of atoms required in the compound to be generated. A 
"closeness grid" is calculated around the seed point to check for steric contacts. A scoring scheme 
similar to that used in DOCK includes the capability to detect potential hydrogen bonding sites. 
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LEGEND builds a structure sequentially, starting from randomly selected atom types 
positioned with random torsion angles (83). Intermolecular interactions are neglected during the 
atom generation process. A candidate atom is selected automatically if it is not bumping either the 
enzyme or any previous atoms in the growing candidate molecule. After a structure is complete, 
charges are assigned to all atoms and the structure is energy minimized in the active site. From the 
many structures generated by LEGEND, a separate post-processing program, LORE, may be used 
to select the more interesting structures for graphical analysis. 

In preparation for site-directed drug design, the CSD has been used to retrieve aliphatic 
fragments and their respective bond properties (84-86). A new algorithm to create diverse, 
irregular, and physically reasonable 3D linear atomic chains or molecular graphs has been 
described (87). These can be used to generate structural templates for joining up regions in an 
active site (87). 

Protein Structures . The use of structural databases in similarity or commonality searching of protein 
sequences and structures has been an area of extensive research during the past decade. 
Applications include prediction of structure from sequence (homology- or knowledge-based), design 
or engineering of proteins, and 3D model building for ligand design. Recent research has focused 
on recognizing the structural information available in known structures or substructures. 
Knowledge-based systems have been developed to access, compile, and analyze protein structure 
properties such as residue features (e.g., side-chain rotamers), secondary structure (e.g. turn 
motifs), and tertiary folds (88-90). 

The similarity searching methods commonly applied to chemical structures are now finding 
their first applications in the field of protein structures. For example, a subgraph isomorphism 
algorithm based on a clique detection procedure has been applied to locate structural patterns in 
pairs of proteins (91). 

EMERGING TOPICS 

Clustering Methods - The speed at which 2D structures are being converted into 3D structures has 
led to an explosion of structural data records based on available databases such as the FCD, CAS, 
or other proprietary databases. This has led to the need to select a subset of 3D structures to be 
considered in a database query. Cluster analysis or automatic classification is the name given to a 
range of techniques for the grouping of multidimensional datasets. Although differing in their 
individual details, these methods are based on the grouping together, or clustering, of the most 
similar objects or pairs of objects. The similarities are typically calculated using nearest-neighbor 
searching routines (92). Various similarity searching strategies have been applied to identify an 
adequate subset of large databases for biological testing while retaining the assurance that 
compounds with novel shapes or properties have not been overlooked. 

A shape-based clustering method has recently been shown to be useful for reducing the 
size of a database while retaining the geometrical diversity (42). When applied to test databases, 
timing studies indicate that the method is applicable to large datasets. One advantage of the 
distance method is that, for the same compound, conformational differences will be evident. 
Although this method does not require precomputing the conformations and exacerbates the 
problem of already large overhead of data storage. The connectivity method clusters together 
molecules of similar connectivity, without dependence on the conformation of the molecules, and 
therefore may not be an adequate representation of shape in systems capable of extending over a 
wide range of distances with a high degree of flexibility. 

The application of the shape similarity approach, using atom triplets as descriptors, has also 
been used for rapid quantitative shape matching between two molecules or molecules and a 
template (93). The overall similarity is mersured as a scoring factor between atom triplets. 

Conclusions - There has been considerable progress in access and retrieval of 3D structural 
database information to aid in finding and designing new biologically active compounds. Three- 
dimensional database retrieval methods can already take ip.tc account the conformational flexibility 
of molecules. Efficient similarity searching methods are now aiming toward direct application to 
structure-activity and molecular design, wether at the level of small organic molecules or protein 
structures. In the current docking methods limited conformational flexibility, when addressed, is 
restricted to the ligand molecules. In general, docking or de novo strategies depend on the 
accuracy of the binding site. While the success of many of the methods depends on the starting 
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localization of the ligand with respect to the binding site, efficient sampling of the binding protein 
within reasonable computational limits ne ds improvement. Future developments will no doubt 
address these limitations. 
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